Weakly Supervised Morphology Learning for Agglutinating Languages Using Small Training Sets

نویسندگان

  • Kseniya B. Shalonova
  • Bruno Golénia
چکیده

The paper describes a weakly supervised approach for decomposing words into all morphemes: stems, prefixes and suffixes, using wordforms with marked stems as training data. As we concentrate on under-resourced languages, the amount of training data is limited and we need some amount of supervision in the form of a small number of wordforms with marked stems. In the first stage we introduce a new Supervised Stem Extraction algorithm (SSE). Once stems have been extracted, an improved unsupervised segmentation algorithm GBUMS (GraphBased Unsupervised Morpheme Segmentation) is used to segment suffix or prefix sequences into individual suffixes and prefixes. The approach, experimentally validated on Turkish and isiZulu languages, gives high performance on test data and is comparable to a fully supervised method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model

This paper presents the WordFrame model, a noiserobust supervised algorithm capable of inducing morphological analyses for languages which exhibit prefixation, suffixation, and internal vowel shifts. In combination with a näive approach to suffix-based morphology, this algorithm is shown to be remarkably effective across a broad range of languages, including those exhibiting infixation and part...

متن کامل

Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods

In this paper we consider the problem of labeling the languages of words in mixed-language documents. This problem is approached in a weakly supervised fashion, as a sequence labeling problem with monolingual text samples for training data. Among the approaches evaluated, a conditional random field model trained with generalized expectation criteria was the most accurate and performed consisten...

متن کامل

Composite Kernel Optimization in Semi-Supervised Metric

Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...

متن کامل

Language Segmentation

Language segmentation consists in finding the boundaries where one language ends and another language begins in a text wrien in more than one language. is is important for all natural language processing tasks. e problem can be solved by training language models on language data. However, in the case of lowor no-resource languages, this is problematic. I therefore investigate whether unsuper...

متن کامل

Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields

We discuss data-driven morphological segmentation, in which word forms are segmented into morphs, the surface forms of morphemes. Our focus is on a lowresource learning setting, in which only a small amount of annotated word forms are available for model training, while unannotated word forms are available in abundance. The current state-of-art methods 1) exploit both the annotated and unannota...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010